Algorithms in bioinformatics course project: main analysis

Lovisa Franzén, lovisa.franzen@scilifelab.se
October, 2019

The data

Budesonide, sold by AstraZeneca under the brand names Pulmicort and Symbicort, is a glucocorticoid used for long term management of asthma and chronic obstructive pulmonary disease (COPD). The compound can be administered through inhalation or as a pill, among other forms. It was sold for the first time as medication in 1981, and has since held a large part of the consumption market.

In a publication by Barrette et al. (2016)1, the authors studied the effect of budesonide treatment in primary human foetal lung explants. They prepared RNA from 3 samples of foetal lung at 23 weeks gestation before (preculture, PC) and after 4 days culture as explants with (Bud) or without (Way) budesonide (30 nM) and performed RNAseq on the 9 samples. The sequencing was performed using Illumina HiSeq 2500, and the processed data is publicly available at Gene Expression Omnibus (accession number GSE83888).


[1] Anne Marie Barrette, Jessica K. Roberts, Cheryl Chapin, Edmund A. Egan, Mark R. Segal, Juan A. Oses-Prieto, Shreya Chand, Alma L. Burlingame, and Philip L. Ballard, "Antiinflammatory Effects of Budesonide in Human Fetal Lung", *American Journal of Respiratory Cell and Molecular Biology*, 2016, link to article

Set-up

First I set everything up by importing the necessary modules, defining paths, and defining basic functions

In [1]:
import os
import re
import json
import numpy as np
from numpy import random
import pandas as pd

from scipy import stats
from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

import networkx as nx
print(nx.__version__)
2.4
In [2]:
dir_project = os.path.realpath('..')
dir_data = os.path.join(dir_project, 'data')
dir_res = os.path.join(dir_project, 'results')
In [3]:
def grep(l, p):
    return [i for i in l if p in i]

Read data

The data files are located in the data/ folder. To limit the scope of this project, I've decided to exclusively focus on the 'Way' (without treatment) and 'Bud' (with budesonide treatment) groups.

In [4]:
data_files = os.listdir(dir_data)
data_files
Out[4]:
['GSM2220787_PC_2.txt',
 'GSM2220786_PC_1.txt',
 'GSM2220784_Bud_2.txt',
 'GSM2220789_Way_1.txt',
 'GSM2220790_Way_2.txt',
 'GSM2220785_Bud_3.txt',
 'GSM2220788_PC_3.txt',
 'GSM2220791_Way_3.txt',
 'GSM2220783_Bud_1.txt']
In [5]:
exp_way_1 = pd.read_csv( os.path.join(dir_data, grep(data_files, 'Way')[0]) , sep = '\t')
exp_way_2 = pd.read_csv( os.path.join(dir_data, grep(data_files, 'Way')[1]) , sep = '\t')
exp_way_3 = pd.read_csv( os.path.join(dir_data, grep(data_files, 'Way')[2]) , sep = '\t')
exp_bud_1 = pd.read_csv( os.path.join(dir_data, grep(data_files, 'Bud')[0]) , sep = '\t')
exp_bud_2 = pd.read_csv( os.path.join(dir_data, grep(data_files, 'Bud')[1]) , sep = '\t')
exp_bud_3 = pd.read_csv( os.path.join(dir_data, grep(data_files, 'Bud')[2]) , sep = '\t')
In [6]:
exp_way = pd.merge(exp_way_1, exp_way_2, on='Gene').merge(exp_way_3, on='Gene')
exp_bud = pd.merge(exp_bud_1, exp_bud_2, on='Gene').merge(exp_bud_3, on='Gene')
In [7]:
exp_all = pd.merge(exp_way, exp_bud, on='Gene')
exp_all = exp_all.rename(columns={'Gene': 'id', 
                                  'Way_1.cpm': 'way_1',
                                  'Way_2.cpm': 'way_2',
                                  'Way_3.cpm': 'way_3',
                                  'Bud_1.cpm': 'bud_1',
                                  'Bud_2.cpm': 'bud_2',
                                  'Bud_3.cpm': 'bud_3',
                                 })

print(exp_all.shape)
exp_all[:10]
(83909, 7)
Out[7]:
id way_1 way_2 way_3 bud_2 bud_3 bud_1
0 ENSG00000000003 127.0126 98.2498 111.1196 77.1369 78.8440 79.1299
1 ENSG00000000005 0.0825 0.0462 0.0000 0.0761 0.0000 0.0000
2 ENSG00000000419 59.1461 56.0107 61.0911 62.1508 60.0492 52.3385
3 ENSG00000000457 14.9875 13.2633 9.9430 13.4647 12.4601 10.5766
4 ENSG00000000460 11.1736 11.6458 7.2557 9.3568 7.2247 7.6602
5 ENSG00000000938 1.5874 4.9911 1.7915 15.5187 6.0206 4.9383
6 ENSG00000000971 280.8461 205.6961 250.5901 159.6369 135.4902 100.4775
7 ENSG00000001036 57.0846 48.6165 60.6433 47.4689 54.8139 47.9834
8 ENSG00000001084 27.6455 35.4457 36.7711 34.6888 41.9873 32.5463
9 ENSG00000001167 88.9150 83.5077 76.4087 52.2614 50.7827 50.9386

Here we can see that the dataset contains a total of 83909 genes, and data collected from the six samples where three samples are without treatment ("way") and three are with budesonide treatment ("bud").

Take a look at the data

In [8]:
exp_all.describe()
Out[8]:
way_1 way_2 way_3 bud_2 bud_3 bud_1
count 83909.000000 83909.000000 83909.000000 83909.000000 83909.000000 83909.000000
mean 11.917672 11.917672 11.917674 11.917672 11.917675 11.917674
std 85.659276 81.590633 93.709144 93.142985 90.955367 110.890260
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.144300 0.138600 0.134400 0.152100 0.104700 0.155500
max 5790.587500 5375.872000 9180.690300 9657.065500 8089.049500 15064.851700

As one can see here, the counts have been normalised through CPM (counts-per-million) conversion and each sample has the same average gene count. I am however unsure if these counts are logged (log2). If that's the case, the differential gene expression analysis may be affected by it, so ideally the counts shouldn't be logged beforehand.

In [9]:
plot_data = exp_all.iloc[:,1:].melt()
g = sns.boxplot(x = "variable", y = "value", data = plot_data)
g.set_yscale("log")
plt.show();
In [10]:
# Visualize with a pair-plot to study correlations between groups
# Since this is a quite intensive plot, I'll only plot a subset of the data
n_subset = 2000
idx = np.random.choice(len(exp_all), n_subset, replace=False)

sns.pairplot(exp_all.iloc[idx,1:], diag_kind="kde", kind="reg");

The data is generally heavily skewed towards lower counts (positively skewed), as is expected for most RNA seq data. Clearer linear correlations in the expression can moreover be seen within each group, as compared to between groups.

Differential gene expression analysis

I couldn't find any modules in python that would perform differential expressions analysis (DEA). Among the ones I found, was to port the popular DESeq2 R algorithm using the rpy2 module, however I had issues to get it to work since it's reliant on your R installation.

Instead I built upon the code we worked with during the course. I reused Lukas' code for computing q-values, and then I wrapped it inside a function that I call dea(). This function performs student's t-test between the two groups for each gene, and also computes the log2 fold-change (log2(mean(post_group)/mean(pre_group))) for the genes. This is a very simplified way of doing DEA, but it seems to work somewhat well.

In [11]:
def dea(expr_df, n_replicates = 3, id_col = "id"):
    """
    :param expr_df: Pandas dataframe containing genes as rows 
    and unique samples as columns. Currently limited to comparisons 
    between two groups, where the columns belonging to samples 
    within each group is next to one another. Currently, the fold-change is 
    computed as the mean of group 2 divided by the mean of group 1, i.e. the 
    group in the first columns serves as the reference group.
    :param n_replicates: Number of samples or replicates within each group
    :param id_col: column name for the gene identifier
    """
    gene_ids = list(expr_df[id_col])
    expr_data = expr_df.drop(columns=[id_col])
    
    nrows_df = expr_df.shape[0]
    output_m = np.zeros((nrows_df, 3))
    
    for i in range(nrows_df):
        a = expr_data.iloc[i, :n_replicates]
        b = expr_data.iloc[i, n_replicates:]
        output_m[i, 0:2] = stats.ttest_ind(b, a)
        if np.isnan(output_m[i,1]):
            output_m[i,1] = 1
        
        a_avg = np.average(a)  # without treatment
        b_avg = np.average(b)  # with treatment
        
        # set 0-counts to non-zero to enable division and logging
        if a_avg == 0:
            a_avg = 0.0001
        if b_avg == 0:
            b_avg = 0.0001
            
        log2fc = np.log2( b_avg / a_avg) 
        output_m[i, 2] = log2fc
        
    
    output_a = pd.DataFrame(np.concatenate((output_m, np.array([gene_ids]).T), axis=1))
    output_a.rename(columns = {0:'t_statistic', 1:'p_value', 2:'log2fc', 3:'id'}, inplace=True)
    output_a[['t_statistic', 'p_value']] = output_a[['t_statistic', 'p_value']].astype(float)
    
    qvals = qvalues(np.array(output_a[['p_value', 'id']]))
    output_b = pd.DataFrame(qvals)
    output_b.rename(columns = {0:'q_value', 1:'p_value', 2:'id'}, inplace=True)
    
    output_df = pd.merge(expr_df, output_a, on='id')
    output_df = pd.merge(output_df, output_b.drop(columns=['p_value']), on='id')
    
    for col in ['t_statistic', 'p_value', 'log2fc']:
        output_df[col] = pd.to_numeric(output_df[col])
        
    output_df = output_df.sort_values(by=['p_value', 'q_value', 'log2fc'])
    
    return output_df


def bootstrap(invec):
    # Lukas' function
    idx = np.random.randint(0, len(invec), len(invec))
    return [invec[i] for i in idx]


def estimatePi0(p, numBoot=100, numLambda=100, maxLambda=0.95):
    # Lukas' function
    p.sort()
    n = len(p)
    lambdas = np.linspace(maxLambda / numLambda, maxLambda, numLambda)
    Wls = np.array([n - np.argmax(p >= l) for l in lambdas])
    pi0s = np.array([Wls[i] / (n * (1 - lambdas[i])) for i in range(numLambda)])
    minPi0 = np.min(pi0s)
    mse = np.zeros(numLambda)
    for boot in range(numBoot):
        pBoot = bootstrap(p)
        pBoot.sort()
        WlsBoot = np.array([n - np.argmax(pBoot >= l) for l in lambdas])
        pi0sBoot = np.array([WlsBoot[i] / (n *(1 - lambdas[i])) for i in range(numLambda)])
        mse = mse + np.square(pi0sBoot-minPi0)
    minIx = np.argmin(mse)
    return pi0s[minIx]


def qvalues(pvalues):
    """
    Modification of Lukas' qvalues function.
    Input: pvalues – should be in the format 
    of a numpy array and contain p-values in 
    the first column and sample id in the 
    second column.
    """
    m = len(pvalues)
    pvalues = pvalues[pvalues[:,0].argsort()]  # sort by p-vals
    #pvalues.sort()
    pi0 = estimatePi0([p for p, gene in pvalues])
    num_p, qs = 0.0, []
    for p, gene in pvalues:
        num_p += 1.0
        q = pi0 * p * m / num_p
        qs.append((q, p, gene))
    qs.reverse()
    old_q = 1.0
    for ix in range(len(qs)):
        q = min(old_q, qs[ix][0])
        old_q = q
        qs[ix] = (q, qs[ix][1], qs[ix][2])
    qs.reverse()
    return qs

First I'll do a pre-filtering of the data set before proceeding. In the article by the data owners, they performed a "prefiltering to exclude genes possessing ≤5 counts per million when summed over all experiments". However, I'd like to be even stricter in my filtering to ensure my gene set consists of a shorter list of genes which display greater differences. Also, since I do not include any shrinkage in my DEA (as compared to e.g. DESeq2) I think it'd better to ensure the genes I analyse have larger expression values.

I'll test to set the threshold to 20 instead of 5.

In [12]:
sum_cutoff = 20
exp_subset = exp_all.loc[(exp_all.iloc[:,1:].sum(axis=1) > sum_cutoff)]

print(exp_subset.shape)
exp_subset.sort_values(by="way_1")[:10]
(12918, 7)
Out[12]:
id way_1 way_2 way_3 bud_2 bud_3 bud_1
16623 ENSG00000183878 0.0206 0.1386 42.7728 0.1141 34.8673 0.0778
17734 ENSG00000188257 0.0412 0.9705 0.9853 51.0823 57.2221 9.9544
1027 ENSG00000067646 0.0412 0.0462 24.7231 0.0380 31.7784 0.0778
6437 ENSG00000131002 0.0618 0.2773 43.2206 0.0000 39.0032 0.0000
324 ENSG00000012817 0.0825 0.1386 67.0032 0.0761 83.9223 0.0389
48617 ENSG00000233864 0.0825 0.1849 11.9585 0.1902 16.0725 0.0389
5264 ENSG00000121933 0.1237 1.0629 0.7166 13.2746 6.7536 4.6273
11696 ENSG00000165246 0.1237 0.0000 29.0228 0.0761 14.5542 0.0389
51036 ENSG00000235847 0.1443 6.9782 11.1075 7.1888 6.5965 0.0000
3779 ENSG00000110077 0.1855 2.0796 1.8363 16.4696 9.9995 4.8217

Applying this cutoff gave me 12918 genes to work with.

Run the "differential gene expression analysis"

In [13]:
exp_dea = dea(exp_subset)
In [14]:
# Check the head of the results table from the DEA
exp_dea[:10]
Out[14]:
id way_1 way_2 way_3 bud_2 bud_3 bud_1 t_statistic p_value log2fc q_value
1662 ENSG00000096384 1680.5205 1724.4549 1762.3271 1224.4538 1238.8875 1236.6814 -20.328395 0.000035 -0.481877 0.067541
4811 ENSG00000130707 92.7289 95.2921 94.3688 54.3534 49.4215 55.9159 -19.516145 0.000041 -0.822407 0.067541
4275 ENSG00000124440 24.6562 25.9257 22.9764 276.4834 232.2914 258.6983 17.977735 0.000056 3.383156 0.067541
7151 ENSG00000152642 44.8389 35.2146 45.7288 108.8589 104.9683 102.6550 16.631593 0.000077 1.331195 0.067541
10834 ENSG00000185813 7.4216 9.6586 8.2410 28.7552 27.9566 31.6520 16.177972 0.000085 1.803110 0.067541
6516 ENSG00000144810 110.5820 110.4039 122.1823 44.9205 50.1021 43.9006 -15.672319 0.000097 -1.304628 0.067541
5664 ENSG00000137449 25.5633 28.1902 27.1417 43.3230 41.8826 41.1786 15.308214 0.000106 0.643690 0.067541
4966 ENSG00000132142 92.9969 87.7593 84.0675 134.3811 138.0555 132.0905 14.929577 0.000117 0.611204 0.067541
11143 ENSG00000189221 35.4382 34.0593 41.1604 368.0740 312.6013 307.1095 14.923340 0.000117 3.158090 0.067541
9923 ENSG00000176102 34.8197 35.8616 35.8754 27.5380 26.6478 28.1524 -14.429567 0.000134 -0.371987 0.067541
In [15]:
print(exp_dea['q_value'].describe(), "\n")
print(exp_dea['p_value'].describe() )
count    12918.000000
mean         0.288198
std          0.129168
min          0.067541
25%          0.179709
50%          0.268216
75%          0.397690
max          0.532184
Name: q_value, dtype: float64 

count    12918.000000
mean         0.340425
std          0.293172
min          0.000035
25%          0.084443
50%          0.252006
75%          0.560502
max          0.999886
Name: p_value, dtype: float64

Taking a quick look at the obtained p- and q-values, we can see that the lowest q-value however quite high (>0.05) which is probably due to the very large number of tests being performed.

In [16]:
# Some QC plots
f, axes = plt.subplots(2, 2, figsize=(16, 8))

sns.distplot(exp_dea['p_value'], bins = 100, kde = True, ax = axes[0,0])
sns.scatterplot(data = exp_dea, x = "p_value", y = "q_value", linewidth = 0, alpha = .1, ax = axes[0,1])
sns.scatterplot(data = exp_dea, x = "q_value", y = range(len(exp_dea["q_value"])), linewidth = 0, alpha = .1, ax = axes[1,0])
axes[1,0].set_xlim(0, 0.2)
axes[1,0].set_ylim(0, 5e3)
sns.scatterplot(data = exp_dea, x = "p_value", y = range(len(exp_dea["p_value"])), linewidth = 0, alpha = .1, ax = axes[1,1])
axes[1,1].set_xlim(0, 0.1)
axes[1,1].set_ylim(0, 5e3)

sns.despine()
plt.tight_layout();

Here we can see the p-value distribution, where it is also apparent that there is a rather smooth gradient between the

In [17]:
np.log2(1.5)
Out[17]:
0.5849625007211562
In [18]:
exp_dea["-log10_pval"] = -np.log10(exp_dea["p_value"])
exp_dea["abs_log2fc"] = np.abs(exp_dea['log2fc'])
exp_dea["cutoff_01"] = (exp_dea['abs_log2fc'] >= np.log2(1.5)) & (exp_dea['p_value'] < 0.01)
exp_dea["cutoff_001"] = (exp_dea['abs_log2fc'] >= 2) & (exp_dea['p_value'] < 0.001)
In [19]:
plt.figure(figsize=(16, 6))
sns.scatterplot(data = exp_dea, x = "log2fc", y = "-log10_pval", 
                linewidth = 0, alpha = .3, hue = "cutoff_01");

Pathways from differentially expressed genes

In order to create a network of pathways, I first need to identify which pathways that my differentially expressed genes (DEGs) belong to.

The steps that I need to take here can be broken down to:

  1. Identify DEGs satisfying specified cut-offs (p-value and/or log-FC)
  2. Identify pathways to which each DEG is part of
  3. Format the information in a way which can be converted into a adjacency matrix/list

Identify DEGs

In [20]:
p_cutoff = 0.01
fc_cutoff = np.log2(1.5)

degs_df = exp_dea.loc[(exp_dea['p_value'] < p_cutoff),]

degs_df.describe()
Out[20]:
way_1 way_2 way_3 bud_2 bud_3 bud_1 t_statistic p_value log2fc q_value -log10_pval abs_log2fc
count 634.000000 634.000000 634.000000 634.000000 634.000000 634.000000 634.000000 634.000000 634.000000 634.000000 634.000000 634.000000
mean 86.066323 84.137416 86.397688 102.043599 96.918927 97.495674 0.228054 0.004427 0.228489 0.087740 2.513640 1.025992
std 160.051202 156.013818 166.034344 208.595503 175.656739 191.779066 7.021970 0.002904 1.332766 0.013319 0.455159 0.879905
min 0.329800 0.415900 0.313500 0.532500 0.366500 0.699900 -20.328395 0.000035 -3.619533 0.067541 2.001013 0.126441
25% 14.245400 14.418600 12.977400 16.393475 16.386550 15.641300 -5.686180 0.001793 -0.624067 0.076195 2.168642 0.482459
50% 39.994250 41.661450 39.727200 42.923600 42.929700 42.636800 4.620026 0.004344 0.173226 0.092634 2.362128 0.762839
75% 96.305725 95.950700 95.936350 113.879650 106.133125 107.039250 6.018753 0.006782 0.924330 0.097521 2.746388 1.240572
max 1965.531000 1892.671900 2043.104500 2864.983000 1627.034600 1896.473000 17.977735 0.009977 6.898735 0.108194 4.461235 6.898735
In [21]:
degs_df = degs_df.loc[(exp_dea['abs_log2fc'] >= fc_cutoff),]

degs_df.describe()
Out[21]:
way_1 way_2 way_3 bud_2 bud_3 bud_1 t_statistic p_value log2fc q_value -log10_pval abs_log2fc
count 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000
mean 59.576803 58.195367 59.699064 94.000176 85.796286 88.292151 1.350046 0.004181 0.419385 0.086631 2.547063 1.374560
std 98.646103 93.969338 105.284870 218.984039 170.494279 192.891788 7.060949 0.002866 1.606891 0.013352 0.463886 0.929697
min 0.329800 0.415900 0.313500 0.532500 0.366500 0.699900 -19.516145 0.000041 -3.619533 0.067541 2.002810 0.586218
25% 9.380100 9.681700 8.509800 9.936900 9.855500 9.847500 -5.548827 0.001624 -0.865127 0.074971 2.188345 0.776151
50% 27.367200 25.810200 26.335500 33.566750 34.396100 36.220850 4.877901 0.003920 0.717641 0.092126 2.406756 1.049050
75% 69.953825 70.902950 69.768825 92.779250 88.031975 96.132150 6.617856 0.006481 1.270766 0.097446 2.789353 1.562774
max 1014.431000 812.940800 975.532300 2864.983000 1627.034600 1896.473000 17.977735 0.009935 6.898735 0.108042 4.390988 6.898735
In [22]:
degs = degs_df["id"]
degs
Out[22]:
4811     ENSG00000130707
4275     ENSG00000124440
7151     ENSG00000152642
10834    ENSG00000185813
6516     ENSG00000144810
              ...       
5747     ENSG00000137962
955      ENSG00000071967
3865     ENSG00000119686
3603     ENSG00000116729
11500    ENSG00000197991
Name: id, Length: 406, dtype: object

Find pathways for DEGs

I found a python module called mygene, which is a wrapper to access data from the MyGene.info services.

In [23]:
import mygene
In [24]:
mg = mygene.MyGeneInfo()
In [25]:
f = "name, symbol, pathway.reactome"  # "all"
degs_mygene = mg.getgenes(degs, fields = f, as_dataframe = False)
querying 1-406...done.
In [26]:
# Look at structure of mygene output:
degs_mygene[:2]
Out[26]:
[{'query': 'ENSG00000130707',
  '_id': '445',
  '_score': 19.113655,
  'name': 'argininosuccinate synthase 1',
  'pathway': {'reactome': [{'id': 'R-HSA-1430728', 'name': 'Metabolism'},
    {'id': 'R-HSA-70635', 'name': 'Urea cycle'},
    {'id': 'R-HSA-71291',
     'name': 'Metabolism of amino acids and derivatives'}]},
  'symbol': 'ASS1'},
 {'query': 'ENSG00000124440',
  '_id': '64344',
  '_score': 20.330833,
  'name': 'hypoxia inducible factor 3 subunit alpha',
  'pathway': {'reactome': [{'id': 'R-HSA-1234158',
     'name': 'Regulation of gene expression by Hypoxia-inducible Factor'},
    {'id': 'R-HSA-1234174', 'name': 'Cellular response to hypoxia'},
    {'id': 'R-HSA-1234176',
     'name': 'Oxygen-dependent proline hydroxylation of Hypoxia-inducible Factor Alpha'},
    {'id': 'R-HSA-1266738', 'name': 'Developmental Biology'},
    {'id': 'R-HSA-2262752', 'name': 'Cellular responses to stress'},
    {'id': 'R-HSA-392499', 'name': 'Metabolism of proteins'},
    {'id': 'R-HSA-452723',
     'name': 'Transcriptional regulation of pluripotent stem cells'},
    {'id': 'R-HSA-597592', 'name': 'Post-translational protein modification'},
    {'id': 'R-HSA-8951664', 'name': 'Neddylation'},
    {'id': 'R-HSA-8953897',
     'name': 'Cellular responses to external stimuli'}]},
  'symbol': 'HIF3A'}]
In [27]:
degs_mygene_df = pd.DataFrame(degs_mygene)
degs_mygene_df[:5]
Out[27]:
query _id _score name pathway symbol notfound
0 ENSG00000130707 445 19.113655 argininosuccinate synthase 1 {'reactome': [{'id': 'R-HSA-1430728', 'name': ... ASS1 NaN
1 ENSG00000124440 64344 20.330833 hypoxia inducible factor 3 subunit alpha {'reactome': [{'id': 'R-HSA-1234158', 'name': ... HIF3A NaN
2 ENSG00000152642 23171 20.864864 glycerol-3-phosphate dehydrogenase 1 like {'reactome': [{'id': 'R-HSA-1430728', 'name': ... GPD1L NaN
3 ENSG00000185813 5833 20.073240 phosphate cytidylyltransferase 2, ethanolamine {'reactome': [{'id': 'R-HSA-1430728', 'name': ... PCYT2 NaN
4 ENSG00000144810 1295 19.690887 collagen type VIII alpha 1 chain {'reactome': [{'id': 'R-HSA-1442490', 'name': ... COL8A1 NaN

Some of the genes didn't seem to have any annotations in the MyGene database. These are reported in the column 'notfound', and will also have NaN in the 'pathway' column if no pathways were found to be associated with the gene.

In [28]:
degs_mygene_df.loc[degs_mygene_df.pathway.isnull()== True, ]
Out[28]:
query _id _score name pathway symbol notfound
5 ENSG00000137449 132864 20.077162 cytoplasmic polyadenylation element binding pr... NaN CPEB2 NaN
6 ENSG00000132142 NaN NaN NaN NaN NaN True
10 ENSG00000162613 8880 20.478895 far upstream element binding protein 1 NaN FUBP1 NaN
11 ENSG00000181773 2827 21.315294 G protein-coupled receptor 3 NaN GPR3 NaN
12 ENSG00000138162 10579 19.287067 transforming acidic coiled-coil containing pro... NaN TACC2 NaN
... ... ... ... ... ... ... ...
393 ENSG00000115295 79745 20.003197 CAP-Gly domain containing linker protein famil... NaN CLIP4 NaN
398 ENSG00000153066 51061 19.984604 thioredoxin domain containing 11 NaN TXNDC11 NaN
399 ENSG00000116299 57535 21.300198 KIAA1324 NaN KIAA1324 NaN
403 ENSG00000119686 55640 20.150984 FLVCR heme transporter 2 NaN FLVCR2 NaN
405 ENSG00000197991 ENSG00000197991 21.300198 NaN NaN AL592490.1 NaN

174 rows × 7 columns

Therefore, I want to filter away these genes before proceeding with building the pathway network.

In [29]:
# Filter: 
#  - 1: Remove genes that were not found in mygene
genes_rm_1 = degs_mygene_df.loc[(degs_mygene_df.notfound == True), "query"]

print(genes_rm_1, '\n')
6      ENSG00000132142
86     ENSG00000110347
134    ENSG00000182851
216    ENSG00000155130
223    ENSG00000129270
256    ENSG00000261088
279    ENSG00000234635
Name: query, dtype: object 

In [30]:
# Filter: 
#  - 2: Remove genes with no annotated pathways
genes_rm_2 = degs_mygene_df.loc[(degs_mygene_df.pathway.isnull()== True), "query"]

print(genes_rm_2, '\n')
5      ENSG00000137449
6      ENSG00000132142
10     ENSG00000162613
11     ENSG00000181773
12     ENSG00000138162
            ...       
393    ENSG00000115295
398    ENSG00000153066
399    ENSG00000116299
403    ENSG00000119686
405    ENSG00000197991
Name: query, Length: 174, dtype: object 

In [31]:
genes_rm = list(genes_rm_1.index)
genes_rm.extend(list(genes_rm_2.index))

print("Genes without pathways:", len(genes_rm))
Genes without pathways: 181
In [32]:
degs_mygene_df_filt = degs_mygene_df.drop(genes_rm)
degs_mygene_df_filt.shape
Out[32]:
(232, 7)
In [33]:
degs_mygene_df_filt = degs_mygene_df_filt.reset_index()

Out of the 399 filtered DEGs, 232 of them had pathways assigned to them

Next, I'll have to create a list of all unique reactome pathway IDs that I can use as a backbone for the network. Then, I'll populate an adjacency matrix by looking at each gene and connecting all the pathways that the gene belongs to. I'll also create a separate dictionary containg all pathway ids as keys and the number of genes annotated to that pathway as values.

In [34]:
pw_list = list(degs_mygene_df_filt["pathway"])
pw_ids = list()

for i in range(0,len(pw_list)):
    if type(pw_list[i]) == dict:  # do not consider genes which have no pathways assigned to it (should've been filtered)
        if type(pw_list[i]['reactome']) == list:  # append when a gene has multiple pws assigned
            ids = [d['id'] for d in pw_list[i]['reactome']]
            pw_ids.append(ids)
        elif type(pw_list[i]['reactome']) == dict:  # append when a gene has a single pw assigned
            ids = pw_list[i]['reactome']['id']
            pw_ids.append(ids)
        # print(i, ids[0])
In [35]:
def flatten(items):
    """
    Yield items from any nested iterable.
    Code found here: https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
    """
    from collections import Iterable # < py38
    for x in items:
        if isinstance(x, Iterable) and not isinstance(x, (str, bytes)):
            for sub_x in flatten(x):
                yield sub_x
        else:
            yield x
In [36]:
pw_ids_flat = list(flatten(pw_ids)) # [item for sublist in pw_ids for item in sublist]
pw_ids_flat_unique = list(set(pw_ids_flat))

print("Total number of pathway IDs:", len(pw_ids_flat))
print("Number of unique pathway IDs:", len(pw_ids_flat_unique))
Total number of pathway IDs: 2576
Number of unique pathway IDs: 846
In [37]:
# Look at how the 'pw_ids' is structured
pw_ids[:3]
Out[37]:
[['R-HSA-1430728', 'R-HSA-70635', 'R-HSA-71291'],
 ['R-HSA-1234158',
  'R-HSA-1234174',
  'R-HSA-1234176',
  'R-HSA-1266738',
  'R-HSA-2262752',
  'R-HSA-392499',
  'R-HSA-452723',
  'R-HSA-597592',
  'R-HSA-8951664',
  'R-HSA-8953897'],
 ['R-HSA-1430728',
  'R-HSA-1483166',
  'R-HSA-1483206',
  'R-HSA-1483257',
  'R-HSA-556833']]

Create adjacency matrix

Create a weighted adjacency matrix. For every time a pathway shares a gene, it will increase it's weight by one. Thus, pathways which are commonly annotated together will show a high connectivity.

For every pathway, I'll also create a dictionary containing the number of occurances of that pathway (i.e. number of genes assigned to that pw).

In [38]:
pwd_list = pw_ids_flat_unique
n = len(pwd_list)
adj_matrix = np.zeros((n, n))
pw_dict = dict()

for pw in pwd_list:
    pw_count = 0
    pw_index = pwd_list.index(pw)
    
    all_pws = [d for d in pw_ids if pw in d]
    pw_dict[pw] = len(all_pws)
    for gene_pws in all_pws:
        if len(gene_pws[0])==1:
            # Handle cases of single pathways, such as "R-HSA-400253"
            gene_pw_index = pwd_list.index(gene_pw)            
            if pw_index != gene_pw_index:
                adj_matrix[pw_index, gene_pw_index] += 1
                adj_matrix[gene_pw_index, pw_index] += 1
            else:
                pw_count += 1
        else:
            for gene_pw in gene_pws:
                gene_pw_index = pwd_list.index(gene_pw)            

                if pw_index != gene_pw_index:
                    adj_matrix[pw_index, gene_pw_index] += 1
                    adj_matrix[gene_pw_index, pw_index] += 1
                else:
                    pw_count += 1

Look at adjacency matrix and pathway weights

In [39]:
print(adj_matrix.shape)
adj_matrix
(846, 846)
Out[39]:
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 3., 0., 0.],
       ...,
       [0., 0., 3., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
In [40]:
# Show first 10 items in the pw_dict dictionary
import itertools
print(len(pw_dict))
dict(itertools.islice(pw_dict.items(), 10))
846
Out[40]:
{'R-HSA-190840': 1,
 'R-HSA-6807047': 1,
 'R-HSA-8941856': 1,
 'R-HSA-180336': 1,
 'R-HSA-3928664': 3,
 'R-HSA-2644603': 1,
 'R-HSA-2022090': 4,
 'R-HSA-9634600': 1,
 'R-HSA-168138': 3,
 'R-HSA-74160': 23}

Plot to describe the pathway connectivity

In [41]:
# Look at pathway size (genes in pathway) distribution
plt.hist(pw_dict.values(), bins=100);
In [42]:
print("Median:", np.median(list(pw_dict.values())))
print("Average:", np.mean(list(pw_dict.values())))
Median: 1.0
Average: 2.6843971631205674

Most pathways seems to have <10 genes connected to it based on the provided DEG list. However, there seems to be a few pathways with a very large number of genes in them

In [43]:
# look at what pathways have the most genes in them
filt_cutoff = 25
d = {k: v for k, v in pw_dict.items() if v >= filt_cutoff}
d = sorted(d.items(), key = lambda x: x[1], reverse=True) 
d
Out[43]:
[('R-HSA-162582', 68),
 ('R-HSA-1430728', 67),
 ('R-HSA-168256', 51),
 ('R-HSA-392499', 33),
 ('R-HSA-556833', 31),
 ('R-HSA-1280215', 27)]
In [44]:
# Look at pathway-pathway connectivity (adjacency matrix)
a = adj_matrix.sum(axis=1)  # [adj_matrix.sum(axis=1)!=0]  # if remove pathways with 0 connections
plt.scatter(a, range(0, len(a)), s = .5)
plt.xlabel("Total number of pathway occurances")
plt.ylabel("Index")
plt.show();

It seems most pathways doesn't have a lot of connections, while a few are very well connected with other pathways.

Save objects before proceeding with next steps

In [45]:
fname = "degs_mygene_df_filt.csv"
degs_mygene_df_filt.to_csv(os.path.join(dir_res, fname), index=False)
In [46]:
np.save(os.path.join(dir_res, "adj_matrix"), adj_matrix)
In [47]:
json.dump(pw_dict, open(os.path.join(dir_res, "pw_dict.txt"),'w'))
In [48]:
# Read objects:
degs_mygene_df_filt = pd.read_csv(os.path.join(dir_res, "degs_mygene_df_filt.csv"))
adj_matrix = np.load(os.path.join(dir_res, "adj_matrix.npy"))
pw_dict = json.load(open( os.path.join(dir_res, "pw_dict.txt") ))

Make pathway column in degs_mygene_df_filt searchable by setting it as a string (instead of dict), and create a function which outputs info regarding the searched pathway.

In [49]:
def identify_pathway(df = degs_mygene_df_filt, pw_search = "R-HSA-71291", print_genes = True):
    # import json
    df['pathway_str'] = df['pathway'].astype(str)
    a = df[df['pathway_str'].str.contains(pw_search)]
    a = a.reset_index()
    
    pw_string = a['pathway_str'][0].replace("'on'", 'on')  # there's a stupid case of 'on' within a string that messes things up, so here I'm removing the quotes around it
    a_dict = json.loads(pw_string.replace("'", '"')) 
    
    if len(pw_search) > 5:
        pw_info = next(item for item in a_dict['reactome'] if item["id"] == pw_search)
        pw_name = pw_info['name']
    else:
        pw_name = "NA"
    
    if print_genes:
        pw_genes_df = pd.DataFrame({'gene_id': a['query'],
                            'gene_symbol': a['symbol'],
                            'gene_name': a['name']})
        print(pw_search, ": ", pw_name, sep='')
        print('\n', 'Genes in pathway:', sep='')
        print(pw_genes_df) 
        return(pw_search, pw_name)
    else:
        return(pw_search, pw_name)

Build a network

In [50]:
G = nx.from_numpy_matrix(adj_matrix)
In [51]:
G.size()
Out[51]:
17606
In [52]:
# Create label dict
g_labels = dict(zip(list(range(0, len(pw_dict))), list(pw_dict.keys())))

Network metrics

First I'll look at some various metrics regarding the network:

In [53]:
# Network node degree, i.e. the number of edges that are connected to the node
degree_list = [d for n, d in G.degree]

plt.hist(degree_list, bins=100)
plt.show();

np.mean(degree_list)
Out[53]:
41.621749408983455
In [54]:
np.median(degree_list)
Out[54]:
26.0

On average, each pathway/node is connected to 41.6 other nodes. But most of the nodes seems to fall below a degree of <50, while a few nodes have a very high degree (>300). The median degree for each node is 26.

In [55]:
# Average shortest path
nx.average_shortest_path_length(G)
Out[55]:
2.330980457985368
In [56]:
# Network diameter, i.e the maximum shortest path length among any nodes in the network:
nx.diameter(G)
Out[56]:
5

One can traverse the network relatively fast it seems, since the network diameter is not more than 5 and the average shortest path in the network is around 2.3.

Cliques
A clique is a subset of nodes in the graph such that every two distinct nodes in the clique are adjacent. I can identify the maximum number of cliques in the network by using the find_cliques function. A maximum clique is one that is not a subset of any other larger clique.

In [57]:
cliques = nx.find_cliques(G)
len(list(cliques))
Out[57]:
631

Draw the network

Then let's have an actual look at the network graph:

In [58]:
nx.draw(G, node_size = 10, edge_color = 'gray', weight=1);

The network currently looks like a big ball of nodes, which I think makes sense for this kind of data.

To prune the network a bit, it can also be useful to look at the minimum spanning tree, where the network has been trimmed to it's core since we only keep the edges that connects all nodes with the minimum number of edges possible.

In [59]:
tree = nx.minimum_spanning_tree(G)
nx.draw(tree, node_size = 5, edge_color = 'gray', weight=1);

Refinement of the network visualization

To make the graph a bit more visually appealing, I want to draw it where the edge widths corresponds to their weight, and the node sizes corresponds to the number of genes sharing that pathway. I'll also throw in some coloring of the nodes according to their size.

In [60]:
edges = G.edges()
#colors = [G[u][v]['color'] for u,v in edges]
weights = [G[u][v]['weight']/10 for u,v in edges]
n_size = [v * 10 for v in pw_dict.values()]

plt.figure(figsize=(15,15),dpi=300)

nx.draw(G, 
        edges = edges, 
        width = weights, 
        edge_color = 'gray',
        # labels=g_labels, 
        # with_labels = True, font_size=6,
        # nodelist=g_labels, 
        node_size = n_size,
        node_color = n_size
        #cmap = plt.cm.YlOrRd
       )

plt.tick_params(
    axis = 'both',       # changes apply to both axes
    which = 'both',      # both major and minor ticks are affected
    bottom = 'off',      # ticks along the bottom edge are off
    top = 'off',         # ticks along the top edge are off
    labelbottom = 'off', # labels along the bottom edge are off
    left = 'off',
    labelleft = 'off')

plt.show;

This looks a lot nicer!

Now ideally, I would want to be able to detect what nodes corresponds to what pathway in a easy and interactive way. It would also be great to include some form of pathway classification that can serve as color-coding for the nodes. That would make it easier to for instance spot a cluster of inflammation-related pathways.

To make the graph interactive, I found a way to do it using the python module 'mpld3' (https://mpld3.github.io/) which combines matplotlib with D3.

Initially, the mpld3 didn't want to work with my networkx object. The error I got said: "Object of type ndarray is not JSON serializable".

To solve this issue I did the following:

  1. Install a small fix to the mpld3 module: python -m pip install --user "git+https://github.com/javadba/mpld3@display_fix"
  2. Add import networkx to ~/.local/lib/python3.7/site-packages/mpld3/_display.py
  3. Add:
    elif isinstance(obj,networkx.classes.reportviews.NodeView):
    return list(obj)
    to the default() function in class NumpyEncoder to make it work (also in _display.py)

Reference: https://stackoverflow.com/questions/47380865/json-serialization-error-using-matplotlib-mpld3-with-linkedbrush

In [61]:
import mpld3
from mpld3 import enable_notebook
enable_notebook()

Categorise pathways and visualize interactively

I also want to find out a bit more information about the most connected pathways, and I can do so by picking out the top most connected pathways and using my custom identify_pathway() function to see what pathways the pathway IDs corresponds to.

In [62]:
pw_details = list()
for pw in list(pw_dict.keys()):
    #print(pw)
    pw_details.append(identify_pathway(pw_search = pw, print_genes = False))
In [63]:
pw_details[:5]
Out[63]:
[('R-HSA-190840',
  'Microtubule-dependent trafficking of connexons from Golgi to the plasma membrane'),
 ('R-HSA-6807047', 'Cholesterol biosynthesis via desmosterol'),
 ('R-HSA-8941856', 'RUNX3 regulates NOTCH signaling'),
 ('R-HSA-180336', 'SHC1 events in EGFR signaling'),
 ('R-HSA-3928664', 'Ephrin signaling')]

Since I couldn't find any quick way to extract pathways classifications from Reactome, I came up with my own way to identify a few selected categories based on identifying certain keywords from the pathway names. The categories I decided to look for were "signaling", "immune", "metabolism", "development", "mitochondria", and "hypoxia". All other pathways not found using these keywords will be classified as "other".

In [65]:
# Example of a simple search
sub = "DNA"
[s for s in pw_details if sub in s[1]]
Out[65]:
[('R-HSA-5693532', 'DNA Double-Strand Break Repair'),
 ('R-HSA-110328',
  'Recognition and association of DNA glycosylase with site containing an affected pyrimidine'),
 ('R-HSA-69473', 'G2/M DNA damage checkpoint'),
 ('R-HSA-1834949', 'Cytosolic sensors of pathogen-associated DNA '),
 ('R-HSA-212300', 'PRC2 methylates histones and DNA'),
 ('R-HSA-5693579', 'Homologous DNA Pairing and Strand Exchange'),
 ('R-HSA-110330',
  'Recognition and association of DNA glycosylase with site containing an affected purine'),
 ('R-HSA-73894', 'DNA Repair'),
 ('R-HSA-69478', 'G2/M DNA replication checkpoint'),
 ('R-HSA-5693616',
  'Presynaptic phase of homologous DNA pairing and strand exchange'),
 ('R-HSA-2559586', 'DNA Damage/Telomere Stress Induced Senescence'),
 ('R-HSA-5334118', 'DNA methylation')]
In [66]:
# Defining pathway classes
col_list = ["#BED7D1", "#FDDFDF", "#F7EBC3", "#c1f5c5", "#c2e9f2", "#DBC4DF", "#e3b78f"]

pw_class = list()
for i in range(len(pw_details)):
    pw_id = pw_details[i][0]
    pw_name = pw_details[i][1]
    #print(pw_name)
    if ("ignaling" or "ignalling") in pw_name:
        info = [pw_id, pw_name, "Signaling", col_list[0]]
        #print(info)
        pw_class.append(info)
    elif ("mmune") in pw_name:
        info = [pw_id, pw_name, "Immune", col_list[1]]
        pw_class.append(info)
    elif ("etabolism") in pw_name:
        info = [pw_id, pw_name, "Metabolism", col_list[2]]
        pw_class.append(info)
    elif ("evelopment") in pw_name:
        info = [pw_id, pw_name, "Development", col_list[3]]
        pw_class.append(info)
    elif ("itochondri") in pw_name:
        info = [pw_id, pw_name, "Mitochondria", col_list[4]]
        pw_class.append(info)
    elif ("ypoxia") in pw_name:
        info = [pw_id, pw_name, "Hypoxia", col_list[5]]
        pw_class.append(info)
    elif ("DNA") in pw_name:
        info = [pw_id, pw_name, "DNA", col_list[6]]
        pw_class.append(info)
    else:
        info = [pw_id, pw_name, "Other", "#EDF0F0"]
        pw_class.append(info)
In [67]:
pw_class_label = [s[2] for s in pw_class]
pw_class_label_color = [s[3] for s in pw_class]
In [69]:
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=col_list[0], label='Signaling', edgecolor='w'),
                   Patch(facecolor=col_list[1], label='Immune', edgecolor='w'),
                   Patch(facecolor=col_list[2], label='Metabolism', edgecolor='w'),
                   Patch(facecolor=col_list[3], label='Development', edgecolor='w'),
                   Patch(facecolor=col_list[4], label='Mitochondria', edgecolor='w'),
                   Patch(facecolor=col_list[5], label='Hypoxia', edgecolor='w'),
                   Patch(facecolor=col_list[6], label='DNA-related', edgecolor='w'),
                   Patch(facecolor="#E4DFD9", label='Other', edgecolor='w')]

Now we can draw the network and color by pathway class, and include node labels which tells us what pathway the node corresponds to:

In [70]:
random.seed(101)
edges = G.edges()
weights = [G[u][v]['weight']/20 for u,v in edges]
n_size = [(v+1) * 2 for v in pw_dict.values()]
pos = nx.spring_layout(G)

fig, ax = plt.subplots(figsize=(8,6), dpi=100)

scatter = nx.draw_networkx_nodes(G, pos, ax = ax,
                                 #nodelist = g_labels,
                                 node_size = n_size,
                                 node_color = pw_class_label_color)

nx.draw_networkx_edges(G, pos, ax=ax,
                       edges = edges,
                       width = weights,
                       edge_color = 'gray',)

ax.legend(handles = legend_elements, loc='best')

tooltip = mpld3.plugins.PointLabelTooltip(scatter, labels=pw_details)
mpld3.plugins.connect(fig, tooltip)

As one can see from this graph, it seems that pathways belonging to the same class generally cluster together in the network. For instance, there is a small cluster of mitochondria/translation-related pathways bundled together. Also, the pathways involving metabolism can mostly be seen to share a lot of connections. Some clusters with pathways which haven't been classified ("other") also make a lot of sense, such as the cluster of pathways corresponding to various cell cycle checkpoints or the one corresponding to gap junction-related pathways. It's also not too surprising that the pathways related to signaling can be seen a bit all over the network, since it is a very vague classification and these pathways could be part in many different cellular processes.

Minimum spanning tree

I can also do the same with the minimum spanning tree instead;

In [71]:
random.seed(101)
edges = tree.edges()
weights = [tree[u][v]['weight']/10 for u,v in edges]
n_size = [v + 2 for v in pw_dict.values()]
pos = nx.spring_layout(tree)

fig, ax = plt.subplots(figsize=(8,6), dpi=100)

scatter = nx.draw_networkx_nodes(tree, pos, ax = ax,
                                 node_size = n_size,
                                 #node_size = 5,
                                 alpha = 0.85,
                                 node_color = pw_class_label_color)

nx.draw_networkx_edges(tree, pos, ax=ax,
                       edges = edges,
                       width = weights,
                       edge_color = 'gray',)

ax.legend(handles = legend_elements, loc='best')

tooltip = mpld3.plugins.PointLabelTooltip(scatter, labels=pw_details)
mpld3.plugins.connect(fig, tooltip)

Identification of "top nodes"

I'd like to see what the pathways having the highest connectivity and that are of largest size corresponds to.

To do this I'll again look at the top largest pathways and then feed them to my identify_pathway() function.

In [72]:
# look at what pathways have the most genes in them
filt_cutoff = 20
d = {k: v for k, v in pw_dict.items() if v >= filt_cutoff}
d = sorted(d.items(), key = lambda x: x[1], reverse=True) 

for p in d:
    pw = p[0]
    #print(pw)
    print(identify_pathway(pw_search = pw, print_genes = False), "Size:", p[1], "DEGs")
('R-HSA-162582', 'Signal Transduction') Size: 68 DEGs
('R-HSA-1430728', 'Metabolism') Size: 67 DEGs
('R-HSA-168256', 'Immune System') Size: 51 DEGs
('R-HSA-392499', 'Metabolism of proteins') Size: 33 DEGs
('R-HSA-556833', 'Metabolism of lipids') Size: 31 DEGs
('R-HSA-1280215', 'Cytokine Signaling in Immune system') Size: 27 DEGs
('R-HSA-597592', 'Post-translational protein modification') Size: 24 DEGs
('R-HSA-74160', 'Gene expression (Transcription)') Size: 23 DEGs
('R-HSA-1643685', 'Disease') Size: 23 DEGs
('R-HSA-73857', 'RNA Polymerase II Transcription') Size: 23 DEGs
('R-HSA-168249', 'Innate Immune System') Size: 23 DEGs
('R-HSA-1266738', 'Developmental Biology') Size: 22 DEGs
('R-HSA-212436', 'Generic Transcription Pathway') Size: 22 DEGs
('R-HSA-109582', 'Hemostasis') Size: 21 DEGs
('R-HSA-372790', 'Signaling by GPCR') Size: 20 DEGs
In [73]:
# look at what pathways are the most connected ones
filt_cutoff = 200
d = {k: v for k, v in dict(G.degree).items() if v >= filt_cutoff}
d = sorted(d.items(), key = lambda x: x[1], reverse=True)

for p in d:
    pw = g_labels[p[0]]
    #print(pw)
    print(identify_pathway(pw_search = pw, print_genes = False), "Degree:", p[1], "edges")
('R-HSA-162582', 'Signal Transduction') Degree: 502 edges
('R-HSA-168256', 'Immune System') Degree: 372 edges
('R-HSA-1643685', 'Disease') Degree: 342 edges
('R-HSA-1430728', 'Metabolism') Degree: 313 edges
('R-HSA-1280215', 'Cytokine Signaling in Immune system') Degree: 310 edges
('R-HSA-1266738', 'Developmental Biology') Degree: 300 edges
('R-HSA-74160', 'Gene expression (Transcription)') Degree: 253 edges
('R-HSA-73857', 'RNA Polymerase II Transcription') Degree: 253 edges
('R-HSA-212436', 'Generic Transcription Pathway') Degree: 250 edges
('R-HSA-392499', 'Metabolism of proteins') Degree: 242 edges
('R-HSA-168249', 'Innate Immune System') Degree: 226 edges
('R-HSA-9006934', 'Signaling by Receptor Tyrosine Kinases') Degree: 215 edges
('R-HSA-597592', 'Post-translational protein modification') Degree: 215 edges
('R-HSA-5663202', 'Diseases of signal transduction') Degree: 213 edges
('R-HSA-109582', 'Hemostasis') Degree: 201 edges

Here we can see that the lagest and most connected pathways are quite generic pathways such as 'Signal Transduction', 'Immune system', and 'Metabolism'. These are probably not the most interesting pathways for this analysis through, since they are part of so many processes.

Project conclusions

Project outcomes in brief
I was able to process a RNA-seq dataset from count matrices until the generation of a pathway network based on differentially expressed genes. The obtained network can be used to study what pathways are more highly "enriched" between the two sample conditions; Budesonide treated and untreated lung cultures. Genes belonging to metabolic pathways, signaling pathways, immune system related pathways, and hypoxia pathways are some examples of pathways that I was able to pick up and visualize in the network. Without much further details about whether the identified pathways are activated or deactivated in the treatment group, it is hard to draw any real conclusions from these results. However, with the time limitation of this course project the scope had to be limited, and I am still happy that I managed to get this far with this kind of analysis by exclusively working in Python and performing a lot of things from scratch that I had very limited knowledge of beforehand.

Short "disclaimer" about the pathway analysis
This is not a true pathway enrichment that I've done here, since ideally you'd also want to take the pathway size (i.e. number of genes annotated within a pathway from the database) into consideration. Some pathways involve hundreds of genes, and seeing many genes from our DEG list within such pathways wouldn't be very surprising in that case, while if a pathway only has ten genes associated with it and nine of them can be found within our DEG list that would be a finding with a lot more significance. In many pathway enrichment tools, they report the percentages of pathway genes found from your given gene list as well as a statistical test, such as Fisher's exact test, where multiple-test correction is also applied.

In [74]:
import datetime
print('Last update done: ', datetime.datetime.now(), '\n',
     "Lovisa Franzén", sep='')
Last update done: 2019-10-28 21:40:10.839961
Lovisa Franzén